Day3

Uroš Godnov

Tidyverse

Tidyverse - 1

  • the tidyverse is an opinionated collection of R packages designed for data science
  • all packages share an underlying philosophy (‘tidy’) and common APIs

Tidyverse - 2

Tidyverse - 3

Tidyverse - 4

Dplyr

  • Hadley Wickham
  • transforming data
  • writing grammar with pipe operator (%>%)
  • pipe operator in magrittr
  • spread and gather in tidyr

Grammar - 1

  • select: return a subset of the columns of a data frame, using a flexible notation
  • filter: extract a subset of rows from a data frame based on logical conditions
  • arrange: reorder rows of a data frame
  • rename: rename variables in a data frame

Grammar - 2

  • group_by() takes an existing tbl and converts it into a grouped tbl where operations are performed “by group”. ungroup() removes grouping
  • mutate: add new variables/columns or transform existing variables
  • summarise / summarize: generate summary statistics of different variables in the data frame, possibly within strata
  • %>%: the “pipe” operator is used to connect multiple verb actions together into a pipeline

Select - 1

  • select(data.frame, columns)
  • select(data.frame, column1:column3) – selects col1, col2, col3
  • select(data.frame, -(col1:col3)) – selects all columns except col1 to col3
  • starts_with() and ends_with()
  • contains()

Select - 2

  • using across to specify conditions
  • in the past select_if, select_all
df1<-iris%>%select_if(is.factor)
head(df1,2)
  Species
1  setosa
2  setosa
df1<-iris%>%select(where(is.factor))
head(df1,2)
  Species
1  setosa
2  setosa
df2<-df1%>%select_all(toupper)
head(df2,2)
  SPECIES
1  setosa
2  setosa
df2 <- df1 %>%
  rename_with(toupper, everything())

head(df2,2)
  SPECIES
1  setosa
2  setosa

Complex select - 1

  • where replaces if
  • combining where with any_of or all_of
df1<-iris%>%select(where(is.factor))
head(df1,2)
  Species
1  setosa
2  setosa

Complex select - 2

  • where & any_of
# A tibble: 3 × 4
  name      age salary department
  <chr>   <dbl>  <dbl> <chr>     
1 Alice      23  50000 Sales     
2 Bob        35  60000 Marketing 
3 Charlie    45  70000 IT        
interested_cols <- c("age", "salary", "experience")

# Select columns that are numeric and whose names are in interested_cols
result <- df %>%
  select(
    where(is.numeric) & any_of(interested_cols)
  )

result
# A tibble: 3 × 2
    age salary
  <dbl>  <dbl>
1    23  50000
2    35  60000
3    45  70000

Complex select - 3

  • where & all_of
  • all_of is strict
interested_cols <- c("age", "salary", "experience")

# Select columns that are numeric and whose names are in interested_cols
result <- df %>%
  select(
    where(is.numeric) & all_of(interested_cols)
  )
Error in `all_of()`:
! Can't subset columns that don't exist.
✖ Column `experience` doesn't exist.

%>%

  • pipe operator
  • stringing together multiple dplyr functions in a sequence of operations first(x) %>% second %>% third
  • from R 4.1 you have native pipe |>

Arrange

  • orders dataframe according to variables
  • arrange(data.frame, col1)
  • arrange(data.frame, desc(col1))
iris%>%select(contains("Sepal"))%>%arrange(desc(Sepal.Width))%>%head(3)
  Sepal.Length Sepal.Width
1          5.7         4.4
2          5.5         4.2
3          5.2         4.1

Rename - 1

  • renames columns
  • rename(dataframe, newcol1=col1,newcol2=col2)
  • new name is on the left side
iris%>%select(contains("Sepal"))%>%arrange(desc(Sepal.Width))%>%
  rename(SepalW=Sepal.Width,
         SepalL=Sepal.Length)%>%head(3)
  SepalL SepalW
1    5.7    4.4
2    5.5    4.2
3    5.2    4.1

Rename - 2

  • rename_with(.data, .fn, .cols = everything(), …)
colName<-function(x) {
  tmp<-stringr::str_sub(x,1,7)
  return(gsub("\\.","",tmp))}
iris %>% 
  rename_with(colName, contains("Sepal")) %>% 
    head(3)
  SepalL SepalW Petal.Length Petal.Width Species
1    5.1    3.5          1.4         0.2  setosa
2    4.9    3.0          1.4         0.2  setosa
3    4.7    3.2          1.3         0.2  setosa

Rename (advanced) - 3

# A tibble: 3 × 3
  old_name1 old_name2 old_name3
      <dbl> <chr>     <lgl>    
1         1 A         TRUE     
2         2 B         FALSE    
3         3 C         TRUE     
# A tibble: 3 × 2
  old_name  new_name 
  <chr>     <chr>    
1 old_name1 new_name1
2 old_name2 new_name2
3 old_name3 new_name3
df_renamed <- df %>%
  rename_with(
    ~ name_mapping_df$new_name[match(.x, name_mapping_df$old_name)],
    .cols = everything()
  )

df_renamed
# A tibble: 3 × 3
  new_name1 new_name2 new_name3
      <dbl> <chr>     <lgl>    
1         1 A         TRUE     
2         2 B         FALSE    
3         3 C         TRUE     

Summarise - 1

  • create one or more scalar variables summarizing the variables of an existing tb
  • powerfull with group_by clause
iris%>%select(-Species)%>%
  summarise(numberOfRows=n())
  numberOfRows
1          150
iris%>%select(-Species)%>%
  summarise(median=median(Petal.Length),
            mean=mean(Petal.Length))
  median  mean
1   4.35 3.758

Summarise - 2

  • summarise_at(): the input is character vector
  • summarise_if(): using predicate function
  • superseded with across function in dplyr v1.0.0
columns<-c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width") 

iris%>%summarise_at(columns,.funs=list(mean, median))
  Sepal.Length_fn1 Sepal.Width_fn1 Petal.Length_fn1 Petal.Width_fn1
1         5.843333        3.057333            3.758        1.199333
  Sepal.Length_fn2 Sepal.Width_fn2 Petal.Length_fn2 Petal.Width_fn2
1              5.8               3             4.35             1.3
iris%>%summarise_if(is.numeric,.funs=list(mean, median))
  Sepal.Length_fn1 Sepal.Width_fn1 Petal.Length_fn1 Petal.Width_fn1
1         5.843333        3.057333            3.758        1.199333
  Sepal.Length_fn2 Sepal.Width_fn2 Petal.Length_fn2 Petal.Width_fn2
1              5.8               3             4.35             1.3

Summarise - 3

  • across: applies funtion(s) to a set of colummns
  • across(.cols = everything(), .fns = NULL, …, .names = NULL)
columns<-c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width") 
iris%>%summarise(across(.cols=all_of(columns),.fns=list(mean, median)))
  Sepal.Length_1 Sepal.Length_2 Sepal.Width_1 Sepal.Width_2 Petal.Length_1
1       5.843333            5.8      3.057333             3          3.758
  Petal.Length_2 Petal.Width_1 Petal.Width_2
1           4.35      1.199333           1.3
iris%>%summarise(across(.cols=is.numeric,.fns=list(mean, median)))
  Sepal.Length_1 Sepal.Length_2 Sepal.Width_1 Sepal.Width_2 Petal.Length_1
1       5.843333            5.8      3.057333             3          3.758
  Petal.Length_2 Petal.Width_1 Petal.Width_2
1           4.35      1.199333           1.3

Mutate - 1

  • computes new variables
  • related verb transmute() which drops non transformed variables
iris %>% 
  mutate(AverageLength=(Sepal.Length+Petal.Length)/2,
              AverageWidth=(Sepal.Width+Petal.Width)/2)%>%
  head(3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species AverageLength
1          5.1         3.5          1.4         0.2  setosa          3.25
2          4.9         3.0          1.4         0.2  setosa          3.15
3          4.7         3.2          1.3         0.2  setosa          3.00
  AverageWidth
1         1.85
2         1.60
3         1.70

Mutate - 2

  • mutate_at, mutate_if
  • mutate+across
iris%>%mutate_if(is.numeric, as.character)%>%head(.,3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9           3          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
iris%>%mutate(across(.cols=is.numeric, .fns=as.character))%>%head(3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9           3          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa

Lab

  • open dplyr.txt and complete the excercises

Group_by

  • used to generate summary statistics from the data frame within strata defined by a variable
iris%>%group_by(Species)%>%summarise(across(.fns=list(mean, median)))
# A tibble: 3 × 9
  Species    Sepal.Length_1 Sepal.Length_2 Sepal.Width_1 Sepal.Width_2
  <fct>               <dbl>          <dbl>         <dbl>         <dbl>
1 setosa               5.01            5            3.43           3.4
2 versicolor           5.94            5.9          2.77           2.8
3 virginica            6.59            6.5          2.97           3  
# ℹ 4 more variables: Petal.Length_1 <dbl>, Petal.Length_2 <dbl>,
#   Petal.Width_1 <dbl>, Petal.Width_2 <dbl>

Slice helpers - 1

  • top_n(), sample_n(), sample_frac()
  • v1.0.0: slice_min(), slice_max(), slice_head(), slice_tail(), slice_sample()
iris%>%top_n(2, wt=Petal.Width)
  Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
1          6.3         3.3          6.0         2.5 virginica
2          7.2         3.6          6.1         2.5 virginica
3          6.7         3.3          5.7         2.5 virginica
iris%>%slice_max(order_by = Petal.Width, n=2)
  Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
1          6.3         3.3          6.0         2.5 virginica
2          7.2         3.6          6.1         2.5 virginica
3          6.7         3.3          5.7         2.5 virginica

Slice helpers - 2

  • without ties
iris%>%slice_max(order_by = Petal.Width, n=2, with_ties=FALSE)
  Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
1          6.3         3.3          6.0         2.5 virginica
2          7.2         3.6          6.1         2.5 virginica
  • random selection
iris%>%sample_frac(size=0.03)
  Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1          5.4         3.9          1.7         0.4     setosa
2          6.1         2.9          4.7         1.4 versicolor
3          6.4         3.2          4.5         1.5 versicolor
4          5.8         2.8          5.1         2.4  virginica

Slice helpers - 3

  • random selection with slice_sample()
iris%>%slice_sample(prop=0.03)
  Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1          5.1         3.8          1.6         0.2     setosa
2          6.7         3.3          5.7         2.1  virginica
3          6.4         2.9          4.3         1.3 versicolor
4          6.1         2.8          4.0         1.3 versicolor

Lab

  • import Master.csv
  • number of players by year of birth n()
  • number of players per birthCountry (asc)
  • average weight in kg per birthCountry (desc) where number of per birthcountry>5 help:
    • select birthyear
    • group by
    • summarise (use function n())
    • arrange

Window rank functions - 1

  • row_number
  • ntile
  • min_rank
  • dense_rank
df<-iris%>%slice_sample(n=10)
df%>%mutate(id=row_number(), groups=ntile(Species,5),
            minrank=min_rank(Species), denserank=dense_rank(Species))%>%
  select(-contains("Petal"),-contains("Sepal"),)
      Species id groups minrank denserank
1  versicolor  1      1       2         2
2  versicolor  2      2       2         2
3  versicolor  3      2       2         2
4   virginica  4      4       8         3
5      setosa  5      1       1         1
6  versicolor  6      3       2         2
7   virginica  7      5       8         3
8   virginica  8      5       8         3
9  versicolor  9      3       2         2
10 versicolor 10      4       2         2

Window rank functions - 2

  • how to add row_numbers in every group by decreasig size of Petal.Length
  • how to get the second largest?
df<-iris%>%slice_sample(n=10)
df%>%group_by(Species)%>%arrange(desc(Petal.Length))%>%
  mutate(id=row_number())%>%
  ungroup()%>%
  select(Petal.Length, Species, id)%>%arrange(Species)
# A tibble: 10 × 3
   Petal.Length Species       id
          <dbl> <fct>      <int>
 1          1.9 setosa         1
 2          1.9 setosa         2
 3          1.7 setosa         3
 4          1.4 setosa         4
 5          4.4 versicolor     1
 6          4   versicolor     2
 7          6.3 virginica      1
 8          5.7 virginica      2
 9          5.5 virginica      3
10          5.3 virginica      4

Lab

  • import Master.csv
  • show the second heaviest man from the first 5 contries with the largest average weight in kg per birthCountry (desc) where number of per birthcountry>5

Pivot_longer and pivot_wider (tidyr) - 1

  • successors to spread and gather
  • pivot_wider
df<-airquality%>%select(Month, Day, Temp)%>%dplyr::filter(Month %in% c(5,6) & Day<4)%>%
  pivot_wider(names_prefix = "Month ", names_from=Month, values_from=Temp)
df
# A tibble: 3 × 3
    Day `Month 5` `Month 6`
  <int>     <int>     <int>
1     1        67        78
2     2        72        74
3     3        74        67

Pivot_longer and pivot_wider (tidyr) - 2

  • pivot_longer
df<-data.frame(day=c("Monday", "Tuesday","Wednesday"),month_aug=c(46,76,32),
               month_sep=c(62,67,23), month_oct=c(43,NA,31))
df
        day month_aug month_sep month_oct
1    Monday        46        62        43
2   Tuesday        76        67        NA
3 Wednesday        32        23        31
df%>%pivot_longer(cols=month_aug:month_oct, names_to = "Month", 
                  values_to = "Temperature")%>%
  head(6)
# A tibble: 6 × 3
  day     Month     Temperature
  <chr>   <chr>           <dbl>
1 Monday  month_aug          46
2 Monday  month_sep          62
3 Monday  month_oct          43
4 Tuesday month_aug          76
5 Tuesday month_sep          67
6 Tuesday month_oct          NA

Pivot_longer and pivot_wider (tidyr) - 3

  • using names_prefix and values_drop_na
df %>% 
  pivot_longer(cols=month_aug:month_oct,names_prefix="month_", 
                  names_to = "Month", values_to = "Temperature", values_drop_na=TRUE) %>%
  head(6)
# A tibble: 6 × 3
  day       Month Temperature
  <chr>     <chr>       <dbl>
1 Monday    aug            46
2 Monday    sep            62
3 Monday    oct            43
4 Tuesday   aug            76
5 Tuesday   sep            67
6 Wednesday aug            32

Pivot_longer and pivot_wider (tidyr) - 4

  • using pivot_longer with values_transform
# A tibble: 3 × 4
  name    height_cm weight_kg class
  <chr>       <dbl>     <dbl> <chr>
1 Alice         170        65 A    
2 Bob           180        75 A    
3 Charlie       175        68 B    
df_long <- df %>%
  pivot_longer(
    cols = c(height_cm, weight_kg, class),
    names_to = "measurement_type",
    values_to = "value"
  )
Error in `pivot_longer()`:
! Can't combine `height_cm` <double> and `class` <character>.
df_long
Error in eval(expr, envir, enclos): object 'df_long' not found

Pivot_longer and pivot_wider (tidyr) - 5

df_long <- df %>%
  pivot_longer(
    cols = c(height_cm, weight_kg, class),
    names_to = "measurement_type",
    values_to = "value",
    values_transform = list(value=as.character)
  )

df_long
# A tibble: 9 × 3
  name    measurement_type value
  <chr>   <chr>            <chr>
1 Alice   height_cm        170  
2 Alice   weight_kg        65   
3 Alice   class            A    
4 Bob     height_cm        180  
5 Bob     weight_kg        75   
6 Bob     class            A    
7 Charlie height_cm        175  
8 Charlie weight_kg        68   
9 Charlie class            B    

Lab

  • import norway_new_car_sales_by_model.xlsx
  • show the number of sold cars by manufacturer (rows) and years(columns)
  • use spread and pivot_wider

fill() - 1

  • carry-forward function
data <- tibble(
  time = 1:5,
  measurement = c(NA, 3, NA, NA, 5)
)

data
# A tibble: 5 × 2
   time measurement
  <int>       <dbl>
1     1          NA
2     2           3
3     3          NA
4     4          NA
5     5           5

fill() - 2

filled_data <- data %>%
  fill(measurement, .direction = "down")

filled_data
# A tibble: 5 × 2
   time measurement
  <int>       <dbl>
1     1          NA
2     2           3
3     3           3
4     4           3
5     5           5

User defined functions in dplyr - 1

  • using in mutate

  • function has to be vectorized; if not, use rowwise() to go row by row

  • vectorized function

fTOc<-function(x) {round((x-32)*5/9,0)}
input<-c(100, 102, 99)
fTOc(input)
[1] 38 39 37
  • nonvectorized function
stupidFunction<-function(x,y){return(sum(x,y))}

airquality%>%mutate(something=stupidFunction(Month, Day))%>%head(3)
  Ozone Solar.R Wind Temp Month Day something
1    41     190  7.4   67     5   1      3488
2    36     118  8.0   72     5   2      3488
3    12     149 12.6   74     5   3      3488

User defined functions in dplyr - 2

  • using rowwise() operation
airquality%>%rowwise()%>%mutate(something=stupidFunction(Month, Day))%>%head(3)
# A tibble: 3 × 7
# Rowwise: 
  Ozone Solar.R  Wind  Temp Month   Day something
  <int>   <int> <dbl> <int> <int> <int>     <int>
1    41     190   7.4    67     5     1         6
2    36     118   8      72     5     2         7
3    12     149  12.6    74     5     3         8
  • or add id and use group_by(id)
airquality%>%mutate(id=row_number())%>%group_by(id)%>%
  mutate(something=stupidFunction(Month, Day))%>%ungroup()%>%head(3)
# A tibble: 3 × 8
  Ozone Solar.R  Wind  Temp Month   Day    id something
  <int>   <int> <dbl> <int> <int> <int> <int>     <int>
1    41     190   7.4    67     5     1     1         6
2    36     118   8      72     5     2     2         7
3    12     149  12.6    74     5     3     3         8

User defined functions in dplyr - 3

  • use Vectorize() function
stupidFunction_v<-Vectorize(stupidFunction)
airquality%>%  
  mutate(something=stupidFunction_v(Month, Day))%>%head(5)
  Ozone Solar.R Wind Temp Month Day something
1    41     190  7.4   67     5   1         6
2    36     118  8.0   72     5   2         7
3    12     149 12.6   74     5   3         8
4    18     313 11.5   62     5   4         9
5    NA      NA 14.3   56     5   5        10

Lab

  • use Titanic dataset
  • write a function, which will return the following:
    • if a person was a child and didn’t survive, the function should return “Poor child”
    • if a person was a woman and didn’t survive, the function should return “Oh no”
    • if a person was an adult man and did survive, the function should return “You shouldn’t save you if there were still women and children aboard”

Lab

  • take dataset mtcars
  • write a function which will tranform mpg to l/100 kms and assign values to a new column consumption
  • display mean and median for mpg, hp, consumption
  • identify the worst and best economic car

Join - 1

  • inner_join
  • left_join
  • right_join
  • full_join

Join - 2

#| echo: false

include_graphics("./Pictures/join.jpg")

Join - 3

  • semi_join
  • anti_join
#| echo: false
include_graphics("./Pictures/semijoin.jpg")

Examples

band_members%>%head()
# A tibble: 3 × 2
  name  band   
  <chr> <chr>  
1 Mick  Stones 
2 John  Beatles
3 Paul  Beatles
band_instruments%>%head()
# A tibble: 3 × 2
  name  plays 
  <chr> <chr> 
1 John  guitar
2 Paul  bass  
3 Keith guitar

Examples

band_members %>% inner_join(band_instruments)
# A tibble: 2 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 John  Beatles guitar
2 Paul  Beatles bass  
band_members %>% left_join(band_instruments)
# A tibble: 3 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 Mick  Stones  <NA>  
2 John  Beatles guitar
3 Paul  Beatles bass  

Examples

band_members %>% right_join(band_instruments)
# A tibble: 3 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 John  Beatles guitar
2 Paul  Beatles bass  
3 Keith <NA>    guitar
band_members %>% full_join(band_instruments)
# A tibble: 4 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 Mick  Stones  <NA>  
2 John  Beatles guitar
3 Paul  Beatles bass  
4 Keith <NA>    guitar

Examples

  • A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x
band_members %>% semi_join(band_instruments)
# A tibble: 2 × 2
  name  band   
  <chr> <chr>  
1 John  Beatles
2 Paul  Beatles
band_members %>% anti_join(band_instruments)
# A tibble: 1 × 2
  name  band  
  <chr> <chr> 
1 Mick  Stones

Joining by many columns, different names

  • inner_join(df1, df2, by=c(“a1”=“b2”,“z3”=“u1”))
d1 <- data_frame(
  x = letters[1:3],
  y = LETTERS[1:3],
  a = rnorm(3)
  )

d2 <- data_frame(
  x2 = letters[3:1],
  y2 = LETTERS[3:1],
  b = rnorm(3)
  )

left_join(d1, d2, by = c("x" = "x2", "y" = "y2"))
# A tibble: 3 × 4
  x     y          a      b
  <chr> <chr>  <dbl>  <dbl>
1 a     A     -0.370 -0.227
2 b     B      0.897 -1.61 
3 c     C     -0.682 -1.55 

Lab

  • download csv 2019 version of data from http://www.seanlahman.com/baseball-archive/statistics/
  • extract master.csv and FieldingOF.csv
  • meta data are available at http://www.seanlahman.com/files/database/readme2019.txt
  • Display the name (firstname+lastname) of the player who had the second highest number of games played in center field for each year from 1990 to 2000

Lag and lead

  • find the “previous” (lag()) or “next” (lead()) values in a vector
r<-data.frame(year=2005:2014,population=sample(14000:15000, 10, replace=T))
r<-cbind(r,lag(r$population))
names(r)<-c("year","pop","pop1")
r%>%
  mutate(index=round(100*(1+(pop-pop1)/pop1),2))%>%
  select(year,pop,pop1,index)
   year   pop  pop1  index
1  2005 14138    NA     NA
2  2006 14760 14138 104.40
3  2007 14801 14760 100.28
4  2008 14433 14801  97.51
5  2009 14749 14433 102.19
6  2010 14209 14749  96.34
7  2011 14694 14209 103.41
8  2012 14049 14694  95.61
9  2013 14141 14049 100.65
10 2014 14214 14141 100.52

Nested ifelse or case_when - 1

  • nested ifelse
df<-as.data.frame(Titanic)%>%select(Class, Sex, Age, Survived)%>%
  dplyr::filter(Class!="Crew")
dfIfElse<-df%>%
mutate(Comment=ifelse(Age=="Child" & Survived=="No","No Child should die!",
              ifelse(Age=="Adult" & Sex=="Male" & Survived=="Yes",
                    "He should help women and children!","No comment")))%>%
  slice_sample(n=10)
dfIfElse
   Class    Sex   Age Survived                            Comment
1    1st Female Adult      Yes                         No comment
2    3rd   Male Child      Yes                         No comment
3    3rd   Male Adult       No                         No comment
4    3rd Female Adult      Yes                         No comment
5    2nd   Male Child       No               No Child should die!
6    1st Female Child       No               No Child should die!
7    1st   Male Adult      Yes He should help women and children!
8    2nd Female Child       No               No Child should die!
9    2nd   Male Child      Yes                         No comment
10   3rd Female Child      Yes                         No comment

Nested ifelse or case_when - 2

  • case_when
df%>%
  mutate(Comment=case_when(Age=="Child" & Survived=="No"~"No Child should die!",
  Age=="Adult" & Sex=="Male" & Survived=="Yes"~"He should help women and children!",
  TRUE~"No comment"))%>%
  slice_sample(n=10)
   Class    Sex   Age Survived                            Comment
1    3rd Female Child      Yes                         No comment
2    1st   Male Adult      Yes He should help women and children!
3    2nd Female Child       No               No Child should die!
4    1st Female Adult      Yes                         No comment
5    2nd Female Adult      Yes                         No comment
6    2nd Female Adult       No                         No comment
7    2nd   Male Child       No               No Child should die!
8    3rd Female Adult       No                         No comment
9    1st Female Child       No               No Child should die!
10   1st Female Adult       No                         No comment

Unite and separate - 1

  • unite combines multiple columns into a single column
data<-as.data.frame(HairEyeColor)
head(data,4)
   Hair   Eye  Sex Freq
1 Black Brown Male   32
2 Brown Brown Male   53
3   Red Brown Male   10
4 Blond Brown Male    3
  • uniting Hair, Eye and Sex into Properties column
data%>%unite("Properties",Hair:Sex, sep="/")%>%
  head(4)
        Properties Freq
1 Black/Brown/Male   32
2 Brown/Brown/Male   53
3   Red/Brown/Male   10
4 Blond/Brown/Male    3

Unite and separate - 2

  • separate turns a single character column into multiple columns
  • splitting the values of the column wherever a separator character appears
df<-data%>%unite("Properties",Hair:Sex, sep="/")
df%>%separate(Properties, into=c("A","B","C"),sep="/")%>%
  head(3)
      A     B    C Freq
1 Black Brown Male   32
2 Brown Brown Male   53
3   Red Brown Male   10
  • what about in this case?
df<-data.frame(name=c("Sam Jones Sr. Mr.",
                      "Lady Gaga Singer",
                      "Valentino Rossi Mr. (Junior) son")); df
                              name
1                Sam Jones Sr. Mr.
2                 Lady Gaga Singer
3 Valentino Rossi Mr. (Junior) son

Unite and separate - 3

  • separate has extra atribute, which decides, what to with “extra data”
df%>%separate(name, into=c("First_name","Last_Name"), extra="drop")
  First_name Last_Name
1        Sam     Jones
2       Lady      Gaga
3  Valentino     Rossi
  • fill: what happens when there are not enough pieces
df%>%separate(name,
into=c("First_name","Last_Name","Something","Title"), extra="drop", fill="left")
  First_name Last_Name Something  Title
1        Sam     Jones        Sr     Mr
2       <NA>      Lady      Gaga Singer
3  Valentino     Rossi        Mr Junior

Nest and unnest

  • nesting creates a list-column of data frames
  • nesting is implicitly a summarising operation
  • convenient for bulding the models
df<-iris%>%nest(-Species)
df
# A tibble: 3 × 2
  Species    data             
  <fct>      <list>           
1 setosa     <tibble [50 × 4]>
2 versicolor <tibble [50 × 4]>
3 virginica  <tibble [50 × 4]>
df%>%unnest()%>%head(2)
# A tibble: 2 × 5
  Species Sepal.Length Sepal.Width Petal.Length Petal.Width
  <fct>          <dbl>       <dbl>        <dbl>       <dbl>
1 setosa           5.1         3.5          1.4         0.2
2 setosa           4.9         3            1.4         0.2

Lab

1. Take airmiles dataset and transform dataframe to a column with name Airmiles. 
Add a new column called Year from 1937 to 1960.
Calculate index for Airmiles. Which year has the highest index.
2. Take dates_df dataframe and split date column into columns month, day and year!
dates_df <- data.frame(date = c("5/24/1930",
                                "5/25/1930",
                                "5/26/1930",
                                "5/27/1930",
                                "5/28/1930"),
                       stringsAsFactors = FALSE)
3. Create 100 random numbers between 1 and 95 and transform it to a column Age in a dataframe 
People. 
Discretize Age into the following age groups:
- 1:20
- 21:45
- 46:55
- 56:70
- 71:
Calculate summary statistics. 

map family functions - 1

  • purrr package
  • map familiy functions transform their input by applying a function to each element of a list or atomic vector and returning an object of the same length as the input
  • map(): returns a list
  • map_lgl(), map_int(), map_dbl() and map_chr() return an atomic vector of the indicated type
  • map_df(), map_dfr() and map_dfc() return a data frame created by row-binding and column-binding
  • there are also map_if and map_at functions, where logic is the same as with _if and _at functions till now.
airquality%>%select(Ozone)%>%
  map(mean)
$Ozone
[1] NA

map family functions - 2

  • one function, no arguments: map
  • one function with 2 arguments: map2
  • one function with many arguments: pmap

map family functions - 3

  • passing a parameter
airquality%>%select(Temp)%>%
  map(mean)
$Temp
[1] 77.88235

map family functions - 4

  • reading files with the same structure, map_df
files<-list.files(pattern = "*.csv", path = "./Mtcars/",  full.names = TRUE)

allDF<-map_df(files, read.csv2)
  • calling more than 1 function in map
  • invoke, invoke_map: retired
airquality%>%select(Ozone)%>%invoke_map(.f=c("mean", "max"),na.rm=TRUE)
[[1]]
[1] 41

[[2]]
[1] 168

map family functions - 5

airquality%>%select(Month, Day)%>%mutate(Something=map2_int(Month,Day,sum))%>%
  head(3)%>%as_tibble()
# A tibble: 3 × 3
  Month   Day Something
  <int> <int>     <int>
1     5     1         6
2     5     2         7
3     5     3         8
airquality%>%select(Month, Day)%>%mutate(Something=map2_chr(Month,Day,sum))%>%
  head(3)%>%as_tibble()
# A tibble: 3 × 3
  Month   Day Something
  <int> <int> <chr>    
1     5     1 6        
2     5     2 7        
3     5     3 8        

map family functions - 6

  • map handles vectorization
stupidFunction<-function(x,y){return(sum(x,y))}

airquality%>%select(Month, Day)%>%
  mutate(Something=map2_chr(Month,Day,stupidFunction))%>%
  head(3)%>%as_tibble()
# A tibble: 3 × 3
  Month   Day Something
  <int> <int> <chr>    
1     5     1 6        
2     5     2 7        
3     5     3 8        

map family functions - 7

  • pmap: arbitrary number of arguments
airquality%>%select(Wind,Temp, Month, Day)%>%
  mutate(Something=pmap_dbl(list(Wind,Temp,Month,Day),sum))%>%
  head(3)%>%as_tibble()
# A tibble: 3 × 5
   Wind  Temp Month   Day Something
  <dbl> <int> <int> <int>     <dbl>
1   7.4    67     5     1      80.4
2   8      72     5     2      87  
3  12.6    74     5     3      94.6
airquality%>%select(Wind,Temp, Month, Day)%>%
  mutate(Something=pmap_chr(list(Wind,Temp,Month,Day),sum))%>%
  head(3)%>%as_tibble()
# A tibble: 3 × 5
   Wind  Temp Month   Day Something
  <dbl> <int> <int> <int> <chr>    
1   7.4    67     5     1 80.400000
2   8      72     5     2 87.000000
3  12.6    74     5     3 94.600000

map family functions - 8

  • passing additional arguments to function
airquality%>%select(Ozone,Day)%>%dplyr::filter(is.na(Ozone))%>%
  mutate(Something=map2_int(Ozone,Day,max))%>%
  head(3)%>%as_tibble()
# A tibble: 3 × 3
  Ozone   Day Something
  <int> <int>     <int>
1    NA     5        NA
2    NA    10        NA
3    NA    25        NA
airquality%>%select(Ozone,Day)%>%dplyr::filter(is.na(Ozone))%>%
  mutate(Something=map2_int(Ozone,Day,max, na.rm=TRUE))%>%
  head(3)%>%as_tibble()
# A tibble: 3 × 3
  Ozone   Day Something
  <int> <int>     <int>
1    NA     5         5
2    NA    10        10
3    NA    25        25

map family functions - 9

  • passing more than one argument to called function
files<-list.files(pattern = "*.csv", path = "./Mtcars/",  full.names = TRUE)

allDF<-map_df(files, read.csv2, skip=1,header=FALSE)

Example of more complex use - 1

  • creating a linear model for every number of cyl
  • when functions get more complex there are basically two ways to call them either with the tilde notation or with a normal anonymous function
call1<-mtcars%>%select(cyl,mpg,wt)%>%
  nest(-cyl)%>%mutate(model=map(data,function(x) lm(formula=mpg~wt,data=x)))

call2<-mtcars%>%select(cyl,mpg,wt)%>%
  nest(-cyl)%>%mutate(model=map(data,~lm(formula=mpg~wt,data=.x)))
call1
# A tibble: 3 × 3
    cyl data              model 
  <dbl> <list>            <list>
1     6 <tibble [7 × 2]>  <lm>  
2     4 <tibble [11 × 2]> <lm>  
3     8 <tibble [14 × 2]> <lm>  

Example of more complex use - 2

  • pluck
call1%>%pluck("data",1)
# A tibble: 7 × 2
    mpg    wt
  <dbl> <dbl>
1  21    2.62
2  21    2.88
3  21.4  3.22
4  18.1  3.46
5  19.2  3.44
6  17.8  3.44
7  19.7  2.77

Example of more complex use - 3

  • making tidy models
call1%>%mutate(tidyModel=map(model, tidy))%>%
  unnest(tidyModel)
# A tibble: 6 × 8
    cyl data              model  term       estimate std.error statistic p.value
  <dbl> <list>            <list> <chr>         <dbl>     <dbl>     <dbl>   <dbl>
1     6 <tibble [7 × 2]>  <lm>   (Intercep…    28.4      4.18       6.79 1.05e-3
2     6 <tibble [7 × 2]>  <lm>   wt            -2.78     1.33      -2.08 9.18e-2
3     4 <tibble [11 × 2]> <lm>   (Intercep…    39.6      4.35       9.10 7.77e-6
4     4 <tibble [11 × 2]> <lm>   wt            -5.65     1.85      -3.05 1.37e-2
5     8 <tibble [14 × 2]> <lm>   (Intercep…    23.9      3.01       7.94 4.05e-6
6     8 <tibble [14 × 2]> <lm>   wt            -2.19     0.739     -2.97 1.18e-2

Lab

  • Write a function called n_unique, which will return the number of distinct values. Apply this function to a dataframe mtcars. Which column has a highest variability (most different distinct values)?
  • Use the appropriate map() function to:
    • Compute the standard deviation of every column in a numeric data frame (mtcars)
    • Compute the standard deviation of every numeric column in a mixed data frame(iris)
    • Compute the number of levels for every factor in a data frame (iris)

Tabular representation with gt()

  • gt package

Simple gt object

df<-sp500 %>%
  dplyr::filter(between(date,ymd("2015-12-24"),ymd("2015-12-31"))) %>%
  select(-adj_close) %>% arrange(date) 
  df %>% gt()
date open high low close volume
2015-12-24 2063.52 2067.36 2058.73 2060.99 1411860000
2015-12-28 2057.77 2057.77 2044.20 2056.50 2492510000
2015-12-29 2060.54 2081.56 2060.54 2078.36 2542000000
2015-12-30 2077.34 2077.34 2061.97 2063.36 2367430000
2015-12-31 2060.59 2062.54 2043.62 2043.94 2655330000

Adding title and formating columns

  • fmt_*
dfGT<-df %>% gt() %>%
  tab_header(title="SP500", 
             subtitle="Last week of 2015") %>% 
   fmt_date(columns = vars(date),date_style = 7) %>% 
   fmt_currency(columns = vars(open, high, low, close),
    currency = "EUR")

dfGT
SP500
Last week of 2015
date open high low close volume
24 Dec 2015 €2,063.52 €2,067.36 €2,058.73 €2,060.99 1411860000
28 Dec 2015 €2,057.77 €2,057.77 €2,044.20 €2,056.50 2492510000
29 Dec 2015 €2,060.54 €2,081.56 €2,060.54 €2,078.36 2542000000
30 Dec 2015 €2,077.34 €2,077.34 €2,061.97 €2,063.36 2367430000
31 Dec 2015 €2,060.59 €2,062.54 €2,043.62 €2,043.94 2655330000

Adding groups - 1

dfgt <- sp500 %>% 
  mutate(year=year(date),wday=wday(date, label=TRUE, abbr=FALSE, locale = "english"))%>%
  select(year, wday, high, low) %>% 
  group_by(year, wday) %>% 
  summarise(high=mean(high), low=mean(low)) %>% 
  dplyr::filter(year %in% c(2014,2015)) %>% 
  ungroup()

dfgt
# A tibble: 10 × 4
    year wday       high   low
   <dbl> <ord>     <dbl> <dbl>
 1  2014 Monday    1941. 1925.
 2  2014 Tuesday   1940. 1924.
 3  2014 Wednesday 1941. 1925.
 4  2014 Thursday  1933. 1914.
 5  2014 Friday    1939. 1923.
 6  2015 Monday    2072. 2050.
 7  2015 Tuesday   2070. 2047.
 8  2015 Wednesday 2072. 2049.
 9  2015 Thursday  2073. 2051.
10  2015 Friday    2072. 2050.

Adding groups - 2

  • rowname_col
dfgt %>% gt(rowname_col = "year") %>% 
  fmt_currency(columns=vars(high,low), currency = "EUR")
wday high low
2014 Monday €1,941.05 €1,924.95
2014 Tuesday €1,940.11 €1,924.20
2014 Wednesday €1,940.99 €1,925.14
2014 Thursday €1,932.90 €1,913.93
2014 Friday €1,939.08 €1,923.17
2015 Monday €2,071.82 €2,049.77
2015 Tuesday €2,070.48 €2,047.26
2015 Wednesday €2,071.74 €2,048.60
2015 Thursday €2,073.29 €2,051.18
2015 Friday €2,072.02 €2,049.98

Adding groups - 3

  • groupname_col
dfgtL<-dfgt %>% gt(groupname_col = "year") %>% fmt_currency(columns=vars(high,low),                           currency = "EUR")
dfgtL
wday high low
2014
Monday €1,941.05 €1,924.95
Tuesday €1,940.11 €1,924.20
Wednesday €1,940.99 €1,925.14
Thursday €1,932.90 €1,913.93
Friday €1,939.08 €1,923.17
2015
Monday €2,071.82 €2,049.77
Tuesday €2,070.48 €2,047.26
Wednesday €2,071.74 €2,048.60
Thursday €2,073.29 €2,051.18
Friday €2,072.02 €2,049.98

Adding summary labels - 1

  • summary_rows
dfgt %>% dplyr::filter(wday %in% c("Monday","Tuesday")) %>% 
  gt(groupname_col = "year") %>% fmt_currency(columns=vars(high,low),             
    currency = "EUR") %>% 
    summary_rows(
    columns = vars(high, low),
    fns = list(average = "mean"))
wday high low
2014
Monday €1,941.05 €1,924.95
Tuesday €1,940.11 €1,924.20
mean 1940.579 1924.574
2015
Monday €2,071.82 €2,049.77
Tuesday €2,070.48 €2,047.26
mean 2071.150 2048.514

Adding summary labels - 2

  • summary_rows by groups
dfgt %>% dplyr::filter(wday %in% c("Monday","Tuesday")) %>% 
  gt(groupname_col = "year", rowname_col ="wday") %>% 
  fmt_currency(columns=vars(high,low),currency = "EUR") %>% 
    summary_rows(
    groups = TRUE,
    columns = vars(high, low),
    fns = list(
      avg = ~mean(., na.rm = TRUE),
      total = ~sum(., na.rm = TRUE),
      s.d. = ~sd(., na.rm = TRUE)
    )
  )

Adding summary labels - 4

high low
2014
Monday €1,941.05 €1,924.95
Tuesday €1,940.11 €1,924.20
2015
Monday €2,071.82 €2,049.77
Tuesday €2,070.48 €2,047.26